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Abstract 

The problem of solving a system of polynomial equations is one of the most fundamental prob¬ 
lems in applied mathematics. Among them, the problem of solving a system of binomial equa¬ 
tions form a important subclass for which specialized techniques exist. For both theoretic and 
applied purposes, the degree of the solution set of a system of binomial equations often plays an 
important role in understanding the geometric structure of the solution set. Its computation, 
however, is computationally intensive. This paper proposes a specialized parallel algorithm for 
computing the degree on GPUs that takes advantage of the massively parallel nature of GPU 
devices. The preliminary implementation shows remarkable efficiency and scalability when com¬ 
pared to the closest CPU-based counterpart. Applied to the “master space problem of Al = 1 
gauge theories” the GPU-based implementation achieves nearly 30 fold speedup over its CPU- 
only counterpart enabling the discovery of previously unknown results. Equally important to 
note is the far superior scalability: with merely 3 GPU devices on a single workstation, the 
GPU-based implementation shows better performance, on certain problems, than a small clus¬ 
ter totaling 100 CPU cores. 
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1. Introduction 


The problem of solving a system of polynomial equations is one of the most fundamen¬ 
tal problems in applied mathematics and science. Among them, the problem of solving 
a system of binomial equations is of special interest for they appear naturally in many 
applications, and specialized and much more efficient algorithm exists (e.g. 125|). In many 
applications, only the solutions of a system of binomial equations for which no variable 
is zero are needed. Such solutions are known as the C*-solution set and will be the focus 
of this article. 

In this article, we propose a parallel algorithm for computing the degree of a C*- 
solution set of a system of binomial equations. This algorithm is specially designed for 
GPU (graphics processing unit) devices by taking advantage of the massively parallel 
nature of GPUs. When applied to a binomial system coming from particle physics, called 
the master space of J\f = 1 gauge theories, this algorithm is able to produce previously 
unknown results. Furthermore, the experimental implementation for GPU built on top 
of the CUDA framework has already shown promising results. Remarkably, with multiple 
GPU devices (on the same computer), the GPU based implementation exhibits much 
better performance, in many cases, than small to medium sized computer clusters. 

This article is structured as follows: First, necessary notations and concepts are in¬ 
troduced. In particular, we shall review basic geometric properties of the C*-solution 
set defined by a binomial system. Then the algorithm for computing the Smith Normal 
Form of an integer matrix is reviewed in §3, as it is an important tool necessary in un¬ 
derstanding the structure of the C*-solution set of a binomial system. The core of this 
article is §4 where a highly scalable parallel algorithm for computing the degree of the 
C*-solution set of a system of binomial equations is presented. A natural by-product 
of the degree computation is a series of homotopy constructions that can be used to 
compute the “witness sets” of components of the C*-solution set of a binomial system, 
which is an important and ubiquitous construction in numerical algebraic geometry. This 
process is explained in §5. The problem of studying the master space of Af = 1 gauge 
theories, arising from string theory, is briefly reviewed in §6, and we show the interesting 
and previously unknown results obtained by applying the parallel algorithm for solving 
systems of binomial equations and computing the degree of the solution set to the master 
space problem. 


2. Laurent binomial systems and its solution set 

First, we shall introduce necessary concepts and notations. For positive integers m and 
n , let M nxm (7j) denote the set of all n x m integer matrices. A square integer matrix is 
said to be unimodular if its determinant is ±1. Note that such a matrix A € M nxn (%) 
has a unique inverse A _1 = de * 4 adj A that is also in M nxn ( Z), where adj A is the adjoint 
matrix of A. The n x n identity matrix in M nxn (Z) is denoted by I n . 

Even though the main application considered in this article are binomial systems, its 
theory is more naturally developed in the context of more general “Laurent binomial sys¬ 
tems” where negative exponents are allowed. For variables x = (x-\.... ,x n ), a Laurent 
monomial in x is an expression of the form aq 1 • • • :r“" where a±,..., a n are integers 
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(which may be zero or negative). For convenience, we shall write a. = (oq,..., a n ) T £ Z™ 
and use the “vector exponent” notation 


x 


OC 



X 


Ot n 

n 


to denote a Laurent monomial. Similarly, for an integer matrix A £ M nxm ( Z) with 
columns W 1 ),..., c* (m ) £ Z n , the “matrix exponent” notation will be used for an m- 
tuple of Laurent monomials: 

A ( a W ••• a < ' rn '>') m (m> 

x A = x\ > ■- (x a ,...,x a ). (1) 

This notation is particularly convenient since the familiar identities x In = x and ( x A ) B = 
x AB still hold. Since the exponents here may be negative, it is only meaningful to consider 
the function x K > x A when we restrict each x t to be nonzero. In particular, throughout 
this article, we shall let Xi £ C* = C \ {0} for each i = 1,..., n. In this case, each matrix 
A £ M nxm (Z) induces a function from ( C*) n to (C*) m given by x H > x A . Of particular 
importance is the function induced by a unimodular matrix A £ M nxn ( Z) since A -1 is 
also in M nxn ( Z), and hence functions x H > x A and x H > x A are the inverses of each 
other (( x A ) A = x AA = x Irl = x). 

A Laurent binomial is an expression of the form ci:c a + C 2 X 13 for some Ci, C 2 £ C* 
and a, (3 £ Z™. This article focuses on the properties of the solution set of systems of 
Laurent binomials equations, or simply Laurent binomial systems, over (C*)™. Stated 
formally, given exponent vectors aW ,..., , (3^,... € Z™ and the coefficients 

c it j £ C*, the goal is to describe the set of all x £ (C*) n that satisfies the system of 
equations 


a (i> 

C 1,1*" 

flt 1 ) 

+ Cl,2® P 

= 0 

Cm,\X a 

0 . fl("0 

+ C m}2 X P 

= 0 


Since only the solutions in (C*) n are concerned, this system is clearly equivalent to 


(x' 


«' -T /i<-> 




) — ( — Ci^/Cip, . . . , —C mj 2/C m ,l)- 


With the more compact “matrix exponent” notation in (1), this system can simply be 
written as 

x A = b or equivalently x A — b = 0 (2) 

where the integer matrix A £ M nxm ( Z), having columns cS^—[3 (yl \ ..., a — (3^ m \ rep¬ 
resents the exponents appeared in the Laurent monomials and the vector b = (—Ci^/cpi, 
..., — Cm, 2 /cm,i) T € (C*) m collects all the coefhcients. The solution set of (2) over (C*) n 
shall be denoted by 

V*(x A -b) = {x£(C*) n \x A -b = 0}. (3) 


The goal of this article is to present efficient parallel algorithms for computing the struc¬ 
tural properties of the set V*(x A — b ): its dimension, number of components, global 
parametrizations, and, most importantly, degree. We shall first briefly review some basic 
facts about the C*-solution set of a Laurent binomial system. A more detailed summary 
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can be found in the article [6] by the first author and Tien-Yien Li. In depth theoret¬ 
ical discussions can be found in standard references such as m 0 Ha Ea EM- Certain 
computational aspects have been studied in |25_i, [26]. 

An important tool in understanding the structure of V* ( x A — b ) is the Smith Normal 
Form of the exponent matrix A £ M nxm (Z): there are unimodular square matrices 
P £ M nxn (Z) and Q £ M mxm (Z) such that 


PAQ = 


d\ 


\ 


\ 


£ M nxm (Z) 


( 4 ) 


0 


with nonzero integers d\ \ d 2 \ ■ ■ ■ \d r for r = rank A, unique up to the signs. Here, a \ b 
means a divides b as usual. This decomposition of the matrix A provides important topo¬ 
logical information about V*(x A — b) C (C*) n summarized in the following proposition: 

Proposition 1 (Topological description [9]). If V*{x A — b) in (C*) ra is not empty, then 
it consists of a finite number of connected components. Furthermore, 

(1) the number of components is exactly rij=i dj ■ 

(2) each solution component has codimension equal to rank A = r. 


This description can be strengthened significantly. Here we shall briefly outline the 
derivation of the stronger description of V*(x A — b) as it provides the important data 
that form the starting point of the degree computation to be discussed in §4. For P and Q 
in the Smith Normal Form of A in (4), let P r £ M rxn (Z ) and Po £ M^ n _ r ^ xn (Z) be the 
top r rows and the remaining n — r rows of P respectively. Similarly, let Q r £ M mxr (Z ) 
and Qq £ M mx ( m _ r )(Z) be the left r columns and the remaining m — r columns of Q 
respectively. With these notations, the Smith Normal Form of (4) of A can be written as 



with D = diag(di,..., d r ) £ M rxr (Z) and 0’s representing zero block matrices of appro¬ 
priate sizes. With this we can transform the binomial system x A = b into a form from 
which more detailed information can be easily extracted. 

Since P and Q are both unimodular the maps z ^ z p and y 1 —>- yQ are both bijections 
on (C*) n and (C*) m respectively. Therefore, considering the solution set in (C*) n , the 
original system x A = b is equivalent to ( x A ) < 5 = x A Q = b®. Similarly, the solution 
sets remain equivalent after the change of variables * = z p , which produces 

(z p ) AQ = z PAQ = z( 08) = (z(o ),z(o)) = b Q = (b Qr , b Qo ) . 
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Since D = diag(di,... ,d r ) £ M rxr (Z), the original system x A = b can now be decom¬ 
posed into a combined system 


) 


= b Qr 

(6) 

= b Qo 

(7) 

: free 

(8) 


where (7) appears when r < m with 1 = (1,..., 1) £ (C*) m-r , and (8) appears when 
r < n. The word “free” in (8) means the system imposes no constraints on the n — r 
variables z r +i,... ,z n . 

Focusing on the above decomposed system, it is clear that if r < to, then the system 
is inconsistent unless 1 = b®°. If the system is consistent (namely, (7) holds), then the 
solutions to (6) are exactly 


' z 1 = e 2klV / dl Ci for k 1 =0,...,d 1 -l 
Z2 = e 2k ^/ d ^ 2 for k 2 =0,...,d 2 -l 


^ _ g2fc r 7r /dj 


C r for k r = 0,..., d r — 1 


(9) 


where each Q is a fixed choice of the dj -th root of j-th coordinate of b®. Clearly, 
all of them are isolated and the total number of these solutions is 1 11^=1 dj | = | detD|. 
If r < n, then the solution set of the decomposed system (6)-(8) in (C*) n breaks into 
“components” of the form {(e 2fcl7r / dl Ci, ■ ■ ■ , e 2kr ™/ dr ( r , z r+ 1, ..., z n ) : (z r+ 1,..., z n ) £ 
(C*)" _r }, and they are in one-to-one correspondence with solutions in (9). Since each 
component is parametrized by the n — r free variables z r+ -\ ,..., z n , it is smooth and of 
dimension n — r. Furthermore, they are disjoint, because these components have distinct 

... ,z r coordinates. 

To translate the above description of the (C*)"-solution set of the decomposed system 
(in z) into a description the original solution set V*(x A — b ), one may simply apply the 
change of variables x = z p . Note that this map and its inverse 2 = x p are both given 
by by monomials ( bi-regular maps |21p. the basic properties of the solution set, such 
as, the number of solution components, their dimensions, and smoothness are therefore 
preserved. To summarize, the above elaborations assert the following proposition. 


Proposition 2 ( Global parametrization [3 l25j 04] ) . For the solution set V*(x A — b) in 
(C*) ra , let P,Q,Qq and D be those matrices appeared in the decompositions of A in (4) 
and (5), and let r = rank A. 

If 1 b®° then the binomial system is inconsistent, and hence its solution set in 

(C*) ra is empty. 

If 1 = b^° then the solution set of x A = b in (C*) n consists of | IIj=i dj\ = \ det ZD| 
connected components Vk lt ...,k r fo r &i £ {0,..., d\ — 1},..., k r £ {0,..., d r — 1}. Each 
component Vk lt ...,k T is smooth of dimension n — r, and it is parametrized by the smooth 
global parametrization 4>k x ,...,k r '■ (C*)( n-r 9 —> 14 ^.. k r given by 


■ ■ ■ ,tn-r) = (e 2kl7T/dl (i, . . . ,e 2kr * /dr (r,th . . . ,tn-r) P ( 10 ) 

where each Q is a fixed choice of the dj -th root of the j-tli coordinate of b®. 
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Note that, as previously stated, in the case of r = n, the solution set V*(x A — b) 
is of dimension n — r = 0, that is, V*(x A — b) consists of isolated points. Then the 
“parametrizations” 4>k x ,...,k T are understood as constants each describes a single isolated 
point. 

As indicated in Proposition 2, for a consistent Laurent binomial system x A = b where 
A £ M nxm (Z) with r = rank(A) < n , each component of the solution set in (C*) Tl will 
be of dimension n — r > 0. In this situation, for both theoretical interests and demands 
from concrete applications, like the Master Space problem to be discussed in §6, one often 
wishes to identify another important property: the degrees of the components. Degree is a 
classic concept developed for plane algebraic curves. For example, the quadratic equation 
y — x 2 =0 defines a curve of degree 2, i.e., the parabola. The generalized notation of 
degree for irreducible algebraic sets is usually formulated algebraically via Hilbert Poly¬ 
nomials. In this article, following the common practice of Numerical Algebraic Geometry, 
we shall take a geometric approach: Let V = Vk lt ...,k r be a component of V*[x A — b) 
for some fixed choice of k \,..., k n as defined in Proposition 2. The number of isolated 
intersection point between V and a “generic” affine space of complementary dimension 
is a fixed number, and this number is the degree of V, denoted by deg V. In algebraic 
terms, we are considering the degree of the projective closure of V. 

Stated more precisely, let Q r be the set of all affine space in C n of dimension r = 
n — dimP. Then it can be shown that in a fixed open and dense subset of Q r , all the affine 
spaces intersect with V at a fixed number of isolated points. This geometric interpretation 
of degree is explained in mmm- 

From a computational standpoint, a generic affine space in Q r can be represented 
by the solution set of a system of d := n — r linear equations with generic coefficients. 
Therefore deg V is precisely the number of points x = (aq,..., x n ) £ V that satisfies the 
system of linear equations 

{ ciiari + C 12 X 2 + ■ ■ ■ + C\ n x n = cio 

I (11) 

Cd,lX\ T Cu2X2 T * * * T CdnXn CdO- 


where for i = 1,..., d and j = 1,..., n are generic complex numbers. But recall that 
the set V = k r is precisely the image of the injective map 

0 fcll ...,fc r (i 1 , ■■■M = (e 2fel7r / dl Ci, ■ • ■, e 2k ^Cr, h,.. -, t d ) p 
in Proposition 2. If we let £ = (e 2fel7r,/dl (i,..., e 2kr ' K l dr and t = then 


where for each j = 1,...,«, pi^ and are the j-th columns of P r and Pq respectively. In 
other words, V has the global parametrization a= £P A) t p ° \ Therefore the intersections 
between V and the generic affine space defined by (11) are precisely the solutions of the 
polynomial system 


' ,„ (1 > 

cn £, p ' t p o 


^Tl( 2 ) ,,/ 2 ) 

' C 12 t^ Pr t p o 


Ch 


,, ^ 


= c 10 


rv (1 > 

C d t p ° 


C d2 ^ t p o’ +... +Cdn Sj p 


(~) 


t p o 


— C d 0 


6 


(j) 

By letting c- := Cij£ Pr G C and c' 0 = cm for each i = 1 ,d and j = 1,... ,n, the 
above is a systems of d polynomial equations in variables t = (ti,...,t d ) with generic 

(j) 

complex coefficients cL and the same set of monomials t p ° . 

Proposition 3 (Degree via affine space cut). If r < n and V*(x A — b) ^ 0, then the 
degree of each component V of V*(x A — b) agrees with the number of solutions t € (C*) d 
of the system of d Laurent polynomial equations 


ClltPo' 

( 2 ) 

+ Ci 2 t P « 


• + Cl n t p 0 ' 

= C10 

Cdltrf' 

( 2 ) 

+ c d2 t p ° 


■' + C dn t p 0 ’ 

= c d 0 


for generic complex coefficients c (J G C. 

It is important to note that for generic coefficients, the C*-solutions of the above 
system are all isolated (O-dimensional), and the total number is a constant. Indeed, in 
[27| . Kushnirenko has shown that this number can be expressed in terms of the volume 
of a geometric object: the Newton polytope of the above system. Here we state the result 
in the context of degree computation and leave the technical statement of Kushnirenko’s 
theorem, as well as its related concepts to Appendix §A. 

Proposition 4 (Degree as volume). For generic choices of the coefficients, the number 
of solutions t G ( C*) d of the system of Laurent polynomial equation (12) and hence the 
degree of V is 

deg V = d\ ■ Vol d (conv{p[ ) 1) ,... ,p^ n) ,0}) (13) 

where 0 = (0,..., 0) T G and columns p^\ ... ,p^ of the matrix Po are considered 
as points in R d . The notation conv denotes the operation of taking convex hull, and Void 
is the volume of a convex body in R d . 

The degree of the solution set of x A = b can be computed efficiently through methods 
in combinatorial geometry. 


3. Parallel Smith Normal Form computation 


As summarized in Equation 2, the key to finding the dimension, number of compo¬ 
nents, and global parametrization of the C*-solution set V* ( x A — b) C (C*) n is the Smith 
Normal Form (4) of the exponent matrix A. In this section, we briefly review the proce¬ 
dure for computing the Smith Normal Form of an integer matrix and then outline the 
parallel modification that is suitable for both multi-core systems and GPU. 

We first briefly review the standard algorithm for computing the Smith Normal Form 
of a matrix with integer entries. One of the classic algorithms for computing the Smith 
Normal Form uses successive row (n) and column (m) reductions of the input matrix, 
as listed in [3 Algorithm 2.4.14] and Q2 Section 8.5.1]: Consider the special case where 
A = ( a \) with 01,02 nonzero, that is, take n = 2 and m = 1. By the Bezout’s identity, 
there exist s and t such that d := gcd(ai, 02 ) = s a\ + 1 02 - Let 


P = 


s 

d 
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Row length 

Speedup ratio 

10 

0.00 

50 

0.00 

100 

1.91 

200 

1.99 

400 

8.01 

800 

14.20 

1600 

22.00 

3200 

31.79 


Table 1. Speedup ratio achieved by a NVidia GTX 780 graphics card (GPU) on a single row 
reduction operation on when compared to an equivalent single-threaded CPU-based implemen¬ 
tation on a Intel Core i7 4770k CPU. In each case, the speedup ratio is computed as the average 
of three runs. Time consumed by data transfer is not computed. 


then det P = sai + tQ2 = ± = 1 and 

sai + ta 2 

CL2CL1 I CL 2 CL 2 
d ' d 

Similarly, for the special case A = ( “1 a 2 ), let Q = ^ ® ^, then AQ = 0j- hr 

general, n x n and m x m version of of the above matrices P and Q can be constructed 
to perform row and column reduction respectively for a n x m integer matrix. 

After repeated such row and column reduction together with potential row and column 
permutations one can construct uninrodular matrices P (1 \ ..., p( fc ) g M nXn ( Z) and 
Q W,..., € M mxm (Z) such that 




p( fe ) • • • P (1) A Q (1 )... qW = ' d T q 

v ■■■„/ 

with r = rank A and d\,... ,d r nonzero. As noted in standard references such as [16] . 
further reduction can ensure d\ \ d 2 \ ■ ■ ■ \ d r , but for the purpose of solving binomial 
system it is not necessary. 

By their design, GPUs are naturally well suited to perform the row and column re¬ 
ductions g0] used in computing the Smith Normal Form. As Table 1 shows, GPUs have 
a clear advantage over CPUs in performing simple row reductions for sufficiently large 
matrices: over 30 fold speedup can be achieved. §6 shows the result of this algorithm 
when applied to the master space problem. 















4. Parallel degree computation 


When the solution set consists of positive dimensional components, Proposition 4 
provides a computationally viable means for computing the degree of each component as 
the normalized volume of a convex polytope. In this section we shall present a parallel 
algorithm for computing the degree that is suitable for both multi-core systems and 
GPUs, though the focus is the GPU-based implementation. Throughout this section, let 
V be a component of V*(x A — b) C (C*) n , and let d = dim V = n — r. Here we shall 
focus on the case where d > 0. Let P° = (p^\ ..., p^) G Md X n( Z) be the matrix 
appears in the Smith Normal For of A (4). Considering each p^ 11 as a point in R d , let 
S = {Pq\ ■ • ■ ,Pq 1) } C R d be the finite point set. Then by Proposition 4, 


deg V = d ! Vold(conv S). 


(14) 


For brevity, let NVol^ = d ! Vol be the normalized volume function in R d , then the 
above equation can be written as 


deg V = NVold(convS') 


(15) 


Therefore the task of computing the degree for V is equivalent to the computation of the 
normalized volume of a lattice polytope (a polytope whose vertices have integer coordi¬ 
nates). 

Remark 1. Clearly, (14) and (15) are equivalent. However, from the computational 
point of view, there is one crucial distinction: The knowledge that NVold(S) must be an 
integer permits the use of efficient but potentially less accurate numerical methods using 
floating point arithmetic and still obtain the correct result. Indeed, the exact results 
can still be obtained as long as the total absolute error is kept below 1/2. This is not 
possible for methods that are designed to compute volume of more general polytopes. The 
algorithm, for computing (15), to be presented below, is hence not directly comparable 
to exact volume computation algorithms mm for general polytopes. 

Our parallel algorithm for the degree computation is developed based on the paral¬ 
lel “mixed cell enumeration” algorithm presented in [5], (See Remark 2 below) Among 
many different approaches for computing the normalized volume, here we adopt a tech¬ 
nique known as regular simplicial subdivision [31] . This approach produces an impor¬ 
tant byproduct that will be used in the computation of witness set, which will be the 
subject of §5. In this approach, we are interested in computing the normalized volume 
NVol a r(conv S) by dividing the lattice polytope convS into a collection of smaller pieces 
for which the volume computation is easy. 

Definition 1 . A cell of S is simply an affinely independent subset of S. A simplicial 
subdivision of S' is a collection V of cells of S, such that 

(1) For each C GT>, convd is a d-sinrplex inside convS; 

(2) For any distinct pair of simplices C\ , C% £ V , the intersection of convCi and 
convC^, if nonempty, is a common face of the two; and 

(3) The union of convex hulls of all cells in T> is exactly conv S. 
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A simplicial subdivision plays an important role in computing NVol^conv S): the 
normalized volume of a d-simplex in R d is easy to compute: given a d-simplex A = 
conv{a 0 ,... ,a d } C Z d , 

NVol d (A) = det ^a 0 ... a fc ) • (16) 

So the volume of conv S can be computed easily as the sum of the volume of all simplices 
in V. 

Note that the simplicial subdivision for a given polytope is, in general, not unique, 
and there are many different approaches for constructing them. Here we focus on the 
approach of regular simplicial subdivision: One can define a “lifting function” ui : S —> R 
by assigning a real number to each point in S. For each point a £ S, a new point 
(a,w(a)) € R ra can be created by using u>(a) as an additional coordinate. This procedure 
“lifts” points of S into R d+1 , the space of one higher dimension. Let 

S = {a = ( a,u>(a )) | a £ 5} (17) 

be the lifted version of S via the lifting function u>. Figures la and lb show examples of 
this lifting procedure.. Let 7r : R d+1 —> R d be the projection that simply erases the last 
coordinate, then 7 t(S) = S. 

Recall that for a face F of the lifted polytope conv S , its inner normal is a vector 
a £ R d+1 such that the linear functional {*,&) attains its minimum over conv A on F. 
Moreover, a face F of conv S is called a lower face with respect to the projection 7r if its 
inner normal a has positive last coordinate. Without loss of generality, in this case, we 
may assume the last coordinate of a to be 1, that is, a = (oq,..., a n , 1) £ R d+1 . It can 
be shown that for almost all choices of the lifting function u> : S — > R, the projections 
of all the d-dimensional lower faces of conv S via 7r form a simplicial subdivision for 
conv S which is called a regular simplicial subdivision of conv S. The construction of this 
simplicial subdivision is therefore equivalent to the enumeration of all the lower faces of 
conv S. 


Example 1. Consider, for example, S = {(0,0), (0,1), (1,1), (1,0)} in the xy-plane. A 
simplicial subdivision of conv S can be obtained via the following procedure: First one 
assign “liftings” wi, W 2 , a> 3 , 0 J 4 £ R to each of the vertices as the ^-coordinate and obtain 
new points (0,0, uq), (0,1, w 2 ), (1,1, w 3 ), (1, 0, uq) in R 3 . It is easy to verify that with 
almost all choices of the liftings the four “lifted” points (Figure la) do not lie on the 
same plane. In that case, the convex hull conv{(0, 0, uq), (0,1, w 2 ), (1,1, w 3 ), (1,0, oq)} of 
the four lifted vertices form a three dimensional polytope (Figure lb) with triangle faces. 
Of particular importance is the lower hull of this polytope which are the faces facing 
downward. As shown in Figure lc, the projection of the faces in the lower hull back onto 
the xy-plane form a simplicial subdivision of the original shape conv S. 


Algebraically speaking, a d-dimensional lower face of conv S is the convex hull of a set 
of d + 1 points {do) • ■ •, C S for which there exists a a = (a, 1) £ R d+1 such that 
the system of inequalities 


I(a 0 , ...,a d ) 


( (d 0 , a) = (Oj,&) for j = 1 ,... ,d 

\ ( a 0l a) < (a, a) for a £ S 


(18) 


is satisfied. In other words, the existence of the lower face defined by {do,..., a d } C S 
is equivalent to the feasibility of the above system of inequalities I(ao,... ,a d ). This 
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(a) Lifted set S (b) Lifted polytope conv S (c) Projection of lower hull 

Fig. 1. Regular simplicial subdivision via generic lifting 

algebraic description of the lower faces is the basis on which enumeration methods are 
developed. While a brute-force approach of checking all the possible combinations of d+1 
points in S against the system of inequalities (18) may be possible, the combinatorial 
explosion will likely render it impractical for all but the most trivial cases. 

In the following subsections, we shall present an approach that results in a parallel 
algorithm which is suitable for both multi-core systems and GPU devices. In this ap¬ 
proach, we employ two complementing processes of “extension” and “pivoting”. We shall 
outline them below. 

Remark 2 (Connection to existing works). The approach developed in this article is a 
natural continuation of a rich web of works on the “mixed cell enumeration” problem 
initiated by the seminal work [24]. The degree computation problem can be considered 
as a special case of the mixed cell enumeration problem, and the connection is explained 
in §A. Active development in the algorithmic aspects of this problem can be found in 
works such as lananaEaisiiBSBsiEi. A broad survey of this topic can be found in 

m- 

The “pivoting” process (for “mixed cell enumeration”), to be described below, was 
proposed in M- However, in terms of performance, it was quickly eclipsed by the “ex¬ 
tension” process developed in m , he and (35* [S3- I n the present article, the pivoting 
and the extension processes are combined as we believe the complementing duo offers 
much better scalability which is crucial in the GPU-based implementations. This is con¬ 
firmed by the numerical experiments, to be presented in §6. 

The graph-theoretic view of the “cell enumeration” process, adopted in this article, 
was originally developed in m and, independently, in ( 551 . The parallelization of the 
algorithm follows the same general idea attempted in [5], but it is modified, in this 
article, to adapt to the massively parallel GPU architectures. 

4-1- Extension of k-faces 

Intuitively speaking, in the extension process, one starts with the vertices of the lower 
hull of conv S. For each of these vertices, systematic attempts are made to “extend” 
it by finding another lower vertex so that the two vertices form a “lower edge” (an 
edge on the lower hull of conv S ). The possible extensions may not be unique, and for 
each possibility, further attempts are made to extend it to 2-dimensional lower faces. 
This process continues until one reaches all the d-dimensional lower faces. Finally, the 
collection of such d-dimensional lower faces will project down, via 7r, to form a simplicial 
subdivision for conv S. 

To describe this process, we first extend the characterization (18) to include lower faces 
of all dimensions: A set of affinely independent k + 1 points in S is said to determine 
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Fig. 3. A direct graph of possible lower ft-faces colored by the feasibility of the corresponding 
system of inequalities. 


a lower fc-face if their convex hull form a fc-dimensional lower face of conv5 with 
respect to the projection 7 r. Stated algebraically, the affinely independent set {a 0 ,..., a 
determines a lower fc-face if and only if there exists an at = (a , 1) £ R rf+1 , such that the 
system of inequalities 


I(a Q ,...,a k ) 


( (d 0 , at) = (a^dt) for j = 1 ,..., k 

\ (d 0 , a) < (a, at) for a £ S 


(19) 


is satisfied. 

Clearly, a lower 0-face is a vertex on the lower hull of conv S. Similarly, a lower 1-face 
is simply a lower edge. We can conveniently organize all possible system of inequalities of 
the above form into a directed acyclic graph, as illustrated ine Figure 2, where each node 
represents a system of inequalities and there is an edge from 1(A) to 1(B) whenever B 
is obtained by joining a new points in S into A. With this construction, the resulting 
graph is graded by the number of points involved. 

It can be easily verified that for generic lifting function w, containment relation be¬ 
tween lower k -faces of the same dimension is impossible. That is, for a fixed k, no lower 
&;-face is contained in another lower fc-face. Therefore the the graph describes precisely 
the containment relationship among possible lower fc-faces. 

A node is said to be feasible if the corresponding system of inequality is feasible. 
Figure 3 shows an example of the labeling of the graph via the feasibility of the nodes: 
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dark for infeasible nodes and white for feasible ones. Recall that a node determines to 
lower fc-faces if and only if it is feasible. Hence we only need to explore of the feasible 
subgraph (the white subgraph in Figure 3). 

One crucial observation is that if two points do not define a lower edge, then they 
cannot be a part of any lower faces of dimension greater than 1. More generally, if a set 
of points does not define a lower fc-face, then there are no lower j-faces containing them 
for any j > k. Stated formally, for F\ C S, 

I{Fi) is infeasible =>■ /(U) is infeasible for all F\ C F C S. (20) 

In terms of the graph, if a node is infeasible, then the entire subgraph reachable by that 
node is infeasible. 

Therefore during the exploration of the graph, once an infeasible node is encountered, 
no further exploration from that node is needed as all nodes reachable are infeasible. 
This simple observation produces significant savings in terms of computation. 

A key procedure in the exploration of the feasible subgraph is the jump from one 
feasible node to another along an edge. Assuming, for some {a 0 ,..., a j,} C S, the node 
I(a o,..., a k ) is feasible, then the feasibility of an adjacent node, say via the edge a k+ i, 
can be determined by solving the linear programming problem 

Minimize (dfc + i,Q:) — (do, a) subject to 
LP(ao,...,a k -,a k + i) : (d 0 , a) = {aj, a) for j = 1 ,..., k (21) 

(do, a) < (a, a) for all a € S 

with the variable a = (a, 1) for a G R d . 

Note that under the constraints, the value of the objective function must be non¬ 
negative. Indeed, the minimum value of 0 is attainable precise when there is an a for 
which the constraints are satisfied, simultaneously to {a k+ i,a) = (a 0 ,a). That is, min¬ 
imum value is 0 if and only if /(a.o,..., a k , a k+ 1 ) is feasible. In this case, a new node 
/(a 0 ,..., a k , a k+ i) is discovered. Geometrically, we have “extended” the lower fc-face 
determined by {d 0 ,..., a k } into a lower (k + l)-face by joining it the new vertex a k+ \. 

Using the extension procedure as a basic building block, we shall discuss, in §4.3, we 
shall discuss the complete parallel algorithm for the exploration of the feasible subgraph. 

4-2. Simplicial pivoting 

In the above we have described a process that gradually explore the feasible subgraph 
via extension procedures. This process is complemented by another process which we shall 
call “simplicial pivoting” which explore the feasible subgraph by “moving sideways” in 
the graph from one lower d-face to another. 

This process starts with a lower d-face of conv<S 
already obtained. Consider, for example, one of the 
lower face shown in Figure 4. Using an edge as a 
hinge, we shall “pivot” one lower face until another 
lower face is obtained. More generally, recall that a 
lower d-face is determined by a set of d + 1 affinely 
independent points {&o, ...,Od} in <S that has an 

Fig. 4: Simplicial pivoting pro¬ 
cedure moves from one lower 
13 face to another 







inner normal of the form a = ( a , 1) with a £ R d+1 . 
Stated algebraically, the system of inequalities 


(d 0 ,a) = (dj,d) for j = 1,..., d 
(dg, a) < (a, a) for all a € S' 


( 22 ) 


is satisfied. Note that the d equalities form a system of d linearly independent constrains 
on a £ R d and hence uniquely determines a. By removing a single equality from the 
above system, we give the inner normal at one degree of freedom which would allow it to 
“pivot”. The goal is to let it pivot until it defines a different lower d-face. 

For any choice i = 0,..., d, with the equality corresponding to di in the above system 
(22) removed, the inner normal a = (a, 1) £ R d , now with one degree of freedom, is 
characterized by the system 

{ (d 0 , at) = (dj,a) for j = 1,..., d, but j ^ i 
(d 0 , a) < (a u at) (23) 

(a 0 , at) < (a, at) for all a £ S 

Note that this system has d— 1 equalities. If a solution with d equalities exists, then that 
solution corresponds to a different lower d-face. In the context of Linear Programming, 
such a solution is called a basic feasible solution. The problem of finding a basic feasible 
solution is known as the Phase One problem in Linear Programming. It can be solved 
exactly and efficiently. 

This procedure is called “simplicial pivoting”. It allows us to pivot from one lower 
d-face to another. By repeatedly applying this procedure, more lower d-faces can be 
gathered. Figure 5 illustrates this process. 



(a) A simplex 



(b) A simplex obtained by 
single pivoting operation 


(c) Another simplex obtained 
through further pivoting oper¬ 
ation 



Fig. 5. Via the pivoting procedure, one moves from a simplex (a) to a different simplex (b) by 
leaving a chosen vertex and then to yet another simplex (c). 


4-3. Traverse the feasible subgraph 

In the above we have formulated the enumeration of lower d-faces as the problem 
of exploring the feasible subgraph of which the lower d-faces is a subset. We also have 
two procedures for “walking” within the graph: The extension procedure moves from 
one lower face to another of one higher dimension while the pivoting procedure jumps 
from one lower d-face to another lower d-face. With these building blocks in place, the 
exploration can be handled by classic graph traversal algorithms which we shall briefly 
review for completeness. 

Most graph traversal algorithms follow a “discover-explore” procedure with proper 
book keeping [4T1 : They gradually explore the graph node by node through the connection 
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between them while keeping track of the nodes visited so that no node is explored twice. 
For a single node, such an algorithm is divided into the discover and explore stages: 
a node is first discovered, and then its connections to other yet unknown nodes are 
explored. Clearly, each node only needs to be visited once. That is, one only needs to 
explore a spanning tree of the graph, (a subgraph that contains all the vertices but is a 
tree in structure), so some mechanism must be used to prevent a node from being visited 
twice. To keep track of the nodes as they are being visited, we assign each task a dynamic 
marker - its state. A node can be in one of the following three states: 
undiscovered The initial status of every node. In this state, the existence of the node 
is completely unknown to us. 

discovered The existence of the node is known, but its connections to other nodes are 
not yet explored. 

completely-explored The existence of the node is known and its connections to other 
nodes have been fully explored. 

Obviously, a node cannot be completely-explored before it is first discovered, so in the 
course of the algorithm, the state of vertices progresses from undiscovered to discovered 
to completely-explored. This point of view also reveals the parallelism in such algorithms: 
nodes on different branches of the spanning tree can be explored in parallel, while con¬ 
secutive nodes on a single branch must be discovered and explored in order. To start the 
algorithm, an initial set of nodes are generated by some other means (bootstrapping). 
The algorithm then discovers other nodes through their connections. From these newly 
discovered nodes the algorithm can discover even more node. This will continue as a 
self-sustaining process until all connected vertices are visited. 

A complete algorithm also need a data structure to keep track of the discovered but 
not yet completely explored vertices (bookkeeping). In the present work, a queue is used. 
It is a linear data structure where newly discovered nodes are added to the back-end 
of the queue. The use of the queue structure essentially imposes an implicit ordering 
of “first-in-first-out”, that is, nodes discovered first are explored first. In the context of 
graph traversal algorithms, this is referred to as a breadth-first strategy in exploring the 
feasible subgraph. Experiments, presented in [5j , suggest that a more flexible ordering of 
nodes within the queue may provide better performance, scalability, and memory usage. 
However, for simplicity, in this work, only the breadth-first approach has been studied. 
The detail of this class of algorithms can be found in standard textbooks such as [3T|. In 
§4.5, we list the pseudo code. 

4-4- Checking for duplicated discovery 

One important problem we must deal with, in the parallel algorithm, is that same 
nodes may be discovered by different threads at the same time. Since the degree is 
the sum of the normalized volume of the projection of all the lower d-faces which are 
represented by the nodes in the feasible subgraph, duplicated nodes will produce incorrect 
results. Therefore, the mechanism for ensuring no duplicated lower d-faces are listed is 
the key to the correctness of the algorithm. 

This mechanism appears to be the bottleneck, in terms of performance, of the original 
algorithm |14| for enumerating “mixed cells” using mainly the pivoting process which is 
one of the main inspiration of the present work (see Remark 2). Our experiments confirm 
that an inefficient checking mechanism would be the limiting factor of the scalability in 
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a parallel implementation. Since on a GPU, it is typical to have thousands of threads 
active simultaneously, the efficiency of such a mechanism is crucial. 

In the present work, the hash table data structure is used to keep track of the nodes, 
in the graph, that have been discovered or completely-explored. The great advantage of 
this choice is that unlike a sorted data structure, hash table provides nearly constant 
access time, in most cases. In our current implementation, for simplicity, the well-tested 
bit-string hash function from the standard C++ library is used. 

Our experiments suggest that a hash table with 2 16 to 2 20 entries is sufficient for all 
problems considered in §6 in the sense that the collision rate in hash table access can be 
virtually ignored. 

4-5. Summary of the algorithm 

In the above, we formulate the degree computation for solution components defined by 
binomial systems as the exploration of the feasible subgraph to be accomplished by the 
two complementing processes: extension and pivoting. In this section, we list the main 
algorithms. 

These algorithms are designed for a system with one or more GPU devices and a 
single CPU with the GPU performance most of the computation intensive tasks. For 
simplicity, we restrict ourselves to modern GPUs manufactured by NVidia and build our 
program based on NVidia CUDA (a GPU programming framework). All the GPU devices 
must share memory since they must all have access to data structures WaitingNodes, 
KnownNodes, and NewNodes. In the current implementation, this is accomplished via a 
technique known as pinned memory j40l provided by the CUDA framework. 

In the following algorithms, the list WaitingNodes contains nodes whose feasibility 
are to be determined by the extension procedure. Cells is the unordered collection of 
lower d-faces already discovered. KnownNodes is the hash table that record the discovery 
of nodes, and it is crucial mechanism by which we ensure the uniqueness of the discovered 
nodes. Finally, NewNodes is an unordered list that keeps track of nodes discovered through 
pivoting or extension. They need to be checked against KnownNodes for uniqueness. 

Random is a function that randomly choose an item from a collection using pseudo 
random number generator. The randomness is employed to achieve a more uniform per¬ 
formance from one run to another which simplifies the benchmarking process. Simplex- 
PhaseOne and SimplexPhaseTwo are the phase-one and phase-two algorithm of the 
simplex method for the linear programming problems (21) and (23) respectively. Even 
though, at over 3000 lines, the C++ code for these two components are the longest and 
most complicated parts of the entire program, they have been a fixture of the long line 
of “mixed volume computation” software developed over the last two decades whence 
the present work inherits much of its techniques and design. Therefore we choose to not 
describe them in detail and refer to works including 0 Q31 ng EH ES9 [381 [39]. 

The Extend procedure tests the feasibility of a node (see §4.1) in the waiting list 
WaitingNodes, and it is designed to run simultaneously on all available threads across 
all GPU devices, 
i: function Extend 
2: if WaitingQueue 0 then 

3: {a 0 , ■ • • , cq.} <— Dequeue( WaitingQueue ) 

4 : F <- SimplexPhaseTwo( LP({a 0 ,..., a fc }) ) 
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5: if F 0 then 

6: NewNodes <r- NewNodes U {P} 

7: end if 

8: end if 

9: end function 

The Pivot procedure implements the simplicial pivoting process detailed in §4.2, and 
it is designed to run simultaneously on all available threads across all GPU devices. It 
picks a random lower d-face already discovered and apply simplicial pivoting to poten¬ 
tially obtain a new lower faces. Just like the Extend procedure above, newly discovered 
nodes will be place in the NewNodes list, 
i: function Pivot 
2: if Cells ^ 0 then 

3: {do, ■ • ■, cbd} t— Random( Cells ) 

4: £ <r- min(d + 1,10) 

5: for i = 1,..., l do 

6: j <- RANDOM( {0,...,d} ) 

7: F <r- SlMPLEXPHASEONE( P({d 0 , • • • , Cid} \ {%}) ) 

8: if P ^ 0 then 

9: NewNodes -f- NewNodes U {P} 

10: end if 

11: end for 

12: end if 

13: end function 

The procedure CheckUniq checks newly discovered nodes against the hash table 
KnownNodes to make sure they have not already been discovered. It will run on a GPU 
device with a large number of threads simultaneously checking the uniqueness of all nodes 
in the list of NewNodes. 
i: function CheckUniq 
2: if NewNodes ^ 0 then 

3: {do, ■ • ■, CLk} t— Dequeue( NewNodes ) 

4: if {do, ■ a k} ^ KnownNodes then 

5: KnownNodes = KnownNodes U {F} 

6: if k = d + 1 then 

7: Cells = Cells U {{do,..., dfc}} 

8: else 

9: for all d £ S \ {do,..., dfc} do 

10: if {do,..., dfc, d} ^ KnownNodes then 

11: WaitingQueue = WaitingQueue U {{do,..., dfc, d}} 

12: end if 

13: end for 

14: end if 

15: end if 

16: end if 

17: end function 

Finally, the main procedure, which runs on the CPU, coordinates all the different 
processes. 
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l: function Main 

2: WaitingNodes <— S 

3: while WaitingNodes /0 do 

4: Run Extend on available GPU threads 

5: Run Pivot on available GPU threads 

6 : Wait for Extend and Pivot 

7: Run CheckUniq on available GPU threads 

8: end while 

9: end function 


5. Computation of witness sets 


The concept of “witness sets” (32', ; 33] is one of the most fundamental and versatile 
tool in numerical algebraic geometry. In its most basic form, given a pure dimensional 
algebraic set, it can be shown that its intersection with a generic affine space of com¬ 
plementary dimension consists of finitely many isolated points. This finite set is called a 
witness set of the algebraic set. It can be used to compute, among many other things, 
the irreducible decomposition and primary decomposition numerically. In many scenar¬ 
ios, it produces the degree of each component as a byproduct. Indeed, this technique (via 
witness sets) was first used to numerically compute the degrees of the “Master Space” 
problem in the work [22] , 

Given the ubiquity of the use of witness sets in numerical algebraic geometry, in this 
section, we shall briefly outline a homotopy construction for computing witness sets for 
a component of V*(x A — b). It is a special case of the polyhedral homotopy [23] , 

Recall that by Proposition 3, the intersection between a component V C V*[x A — b) 
and a generic affine space of complementary dimension consists of precisely the points 
t = (ti,...,td) € (C*) d that satisfy the system of d Laurent polynomial equation in d 
variables given by 

( 1 ) ( 2 ) (») 

Cut p ° + C 12 t p 0 +... + Clnt Po = Cw 


(24) 


„(!) 


J 2 ~> 


+ C dn t P ° = c d0 


C dl t P ° + C d2 t p 0 

where the coefficients depends on both the choice of the component in V* ( x A — b) and 
the choice of the r-dimensional affine space. 

Reusing the notations from §4, let S = {p^,... ,Pq”’}, and let to : S —» R be the 
generic lifting function used for constructing regular simplicial subdivision of conv S in 
§4. Without loss of generality, we can pick w to have images only in Q. With these, we 
introduce a new variable s and consider 


' Cll t p o V (p o > + ■•■ + c ln t p o s w( *o ) - c 10 s“ (p o = £ ags Cl , a t a s‘ 


H(t,s) = < 


j(a) 


(25) 


*P (1) a 
Cdlt P ° S 




A") 

+ Cdnt 0 S' 




— CdOS 


MO) _ 


E, 


r , 

1 6 S«,ol' * 


which is constructed by multiplying each term in (24) by a rational power of the new 
variable s whose exponent is determined by the lifting function ui : S Q. Clearly, 
Hit, 1) = 0 is exactly the system (24) which we aim to solve (inside (C*) d ). As s varies, 
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however, H represents a continuous deformation of the system (24), or a homotopy. The 
central idea behind the homotopy continuation method for solving systems of equations 
is the deformation of a system into a “starting system” which one can solve easily. Then 
numerical continuation methods are employed to trace the movement of the solutions of 
the starting system under the deformation toward the solutions of the original system 
which one aims to solve. 

The key here is to find an appropriate starting system that can be easily solved. As 
is, H(t, 0) cannot be used as the starting system since at s = 0, the system is either 
identically zero or undefined. Therefore certain transformation is necessary to produce a 
meaningful and solvable starting system. Such transformations are given by the regular 
simplicial subdivision discussed in §4. 

Still let T> be a regular simplicial subdivision obtained by the algorithm presented in 
§4.5. Recall that each cell in V is a projection of a cell of the form {do,..., d^} such that 
convjdo,..., a d } is a lower d-face of conv<S that is characterized by (22). That is, there 
exists a (unique) vector of the form a = (oq,..., ad, 1) such that 

(d 0 , a) = (a.j, a) forj = l,...,d 
(do, a) < (d, a) for all a £ S 

Using a = (an,..., ad, 1), we shall consider the change of variables 

fti =yis ai 

t=l i (27) 

[t d =ydS ad 

with which H becomes 

[ E o6S Cl,a!/ a S <a ' a)+ “ (a) = EaeS^V a S^ 
H(t,s) = H(yis ai ,...,y d s ad ,s) = <j | 

[e a& s c d, a y a s {a ’ a)+ul{a) = E ae5 Cd,a 2 / a S<“’“> 

Let /3 = (do, a) and define a new homotopy 

[^EaesCl,ay“^“’ d> 

H a ’P(y,s) = s~^H( yi s a \. m y d s a *, s) = J | 

{s^E a& s^, a y a s^ 

Note that the new homotopy still has the necessary property that 1) = 0 is 

identical to the system (24) which we aim to solve. 

One important observation here is that, by (26), there are precisely d+ 1 terms in each 
component of H a, P(y, s ) having no power of s (the terms corresponding to do,..., a d ), 
and all other terms have positive powers of s. Consequently, at s = 0, terms with positive 
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powers of s vanish, leaving only 


Cl ,a o y a ° 

+ ci, ai y ai 

+ • • 

' -+Cl : a d y ad 

= 0 

C2,a a y a ° 

+ c 2 , ai y ai 

+ • • 

■ ■+C 2 ,a. d y ad 

= 0 

cd, ao y a ° 

+ c d , ai y ai 


■ ■ + Cd, ad y ad 

= 0 


To simplify the notation, let 


( c b 


C = 


a 0 ' ' ' c l,a d 


\Cd,a 0 ' ' ’ c d,ad J 

then the above equation can be written as 



C ■ (y r ) T = o. 


( 28 ) 


(29) 


For generic choices of the coefficients, there exists a nonsingular matrix G £ M dX d( C) 
such that 




GC = 


b 21 


C 


* 

22 


(30) 


V <4 < 4 / 

for some c* ? £ C*. Then without altering its solution set, (29) can be transformed into 
the equivalent system 


r* n a ° 
c iiV 


GC(t l ) 1 = 


c* 21 V ai 


+ c\ 2 y ad = o 
+ c* 2 y a * = 0 


c di V 


+ c* d2 y ad = 0 


(31) 


which is also a Laurent binomial system. Therefore, the algorithm outlined can then be 
used to solve this system. The solutions are precisely the solutions of the starting system 
(28) for the homotopy H a '^. Then numerical continuation techniques can be applied to 
trace the solutions toward s = 1 producing solutions to the target system (24), which 
will be points in the witness set of the component V of V*(x A — b). 

Recall that the construction of the homotopy H a - 13 depends on a cell in the regular 
simplicial subdivision T> of conv S. It is typical for T> to contain more than one cell. In this 
case, each cell induce a different homotopy of the form of H a ^. The above construction 
is a special case of the polyhedral homotopy |24j . and its theory guarantees that as one 
go through all the cells in V , the resulting homotopies of the form H a ' /j will find all the 
points in the witness set of V. 
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6. Master space of AT — 1 gauge theories 


In this section, we consider a system arising from theoretical physics, in particular, 
string theory. A central area of current research in string theory is the study of the vac¬ 
uum moduli space, which, roughly speaking, is the space of continuous solutions (or the 
affine algebraic variety) of a multivariate nonlinear function, called the superpotential 
of the theory under consideration. Here, the vacuum moduli spaces are spaces of special 
holonomy such as Calabi-Yau or Gi manifolds. Different positive-dimensional compo¬ 
nents of the vacuum moduli space correspond to different particle branches, such as 
mesonic, baryonic, etc. Symbolic algebraic geometry methods have been used to study 
the complicated structures of the vacuum moduli spaces of various string theory models 
nzinmra. However, the methods are known to run out of the steam for even moderate 
sized systems due to the algorithmic complexity issues. Recently, numerical algebraic ge¬ 
ometry methods have been introduced to string theory research and have solved bigger 

systems HOI 12H [Ml [3H ESI IM1 ESI I3H] - 

In this article, we consider special types of models coming from string theory in which 
the systems to be solved are binomial systems, and in which the vacuum moduli spaces 
are composed of unions of positive-dimensional components. We take a model which is 
actively investigated by string theorists because its vacuum moduli space is a combination 
of mesonic and baryonic branches mm- Such moduli spaces are called master spaces. 

In particular, we consider the superpotential for Af = 1 gauge theories for a D 3- 
brane on the Abelian orbifold C 3 /Z m x Z&. The superpotential for this theory, for fixed 
m, k £ N, is given by 


m— 1 k— 1 



(32) 


where the periodic boundary conditions are imposed, e.g., Xi y7n = Xi y o for any i and x ky j = 
xqj for any j. This is a polynomial in 3 mk variables: for the combinations 

of i £ 7* k and j £ Z m . For example, when m = k = 2, the superpotential is 



We are interested in finding the critical points of W myk , that is, points at which all the 
partial derivatives of the superpotential W„ lyk , with respect to variables Xij,yij, Zij, are 
zero. These points are precisely the solutions to the system of polynomial equation 


dW. myk = dW myk _ dW myk 
dx it j dyij dz iyj 


(33) 


in the variables Xij, yij, Zij. 

Notice that in W myk , each variable appears in exactly two distinct terms. Consequently, 
the partial derivative of W myk with respect to each variable consists of exactly two terms, 
hence it forms a binomial polynomial. For instance, 


dW 2}2 

dx 0y0 


— y i,o~i,i — ?/o,iZi,o, —— — —2/o, o^i.i + 2/i, i^i, o- 

ox 0.1 


dW‘2.2 


■ 0,1 
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m/k 

1 

2 

3 

4 

5 

6 

7 

8 

1 

N/A 

4 

5 

6 

7 

8 

9 

10 

2 

4 

6 

8 

10 

12 

14 

16 

18 

3 

5 

8 

11 

14 

17 

20 

23 

26 

4 

6 

10 

14 

18 

22 

26 

30 

34 

5 

7 

12 

17 

22 

27 

32 

37 

42 

6 

8 

14 

20 

26 

32 

38 

44 

50 

7 

9 

16 

23 

30 

37 

44 

51 

58 

8 

10 

18 

26 

34 

42 

50 

58 

66 


Table 2. The dimension of V* (VWm,k) for a range of values for m and k. 


m = k 

9 

10 

11 

12 

13 

14 

15 

20 

25 

30 

35 

40 

Dim. 

83 

102 

123 

146 

171 

198 

227 

402 

627 

902 

1227 

1602 


Table 3. The dimension of V*(X7W m ,k) for a range of larger values for m = k. 

Therefore (33) is indeed a binomial system which shall simply be denoted by VW mj fc. We 
are interested in computing the dimension and degree of components of the C*-solution 
set V*(VW / mj fc) of this system. 

The dimension and the degree of the top dimensional components of this system was 
first computed in m for up to m = 3 and k = 5 using the Grobner basis method. 
Later on, in 1361 , the dimensions and degrees of all the components for up to m = 3 = k 
were carried out using numerical algebraic geometry methods. The parallel GPU-based 
implementation of the binomial solver we have proposed can compute, very quickly, the 
dimension and the global parametrization of the C*-solution set for higher values of to and 
k. Table 2 and 3 show the dimension of V*(V4U mj fc) C (C*) 3mfc for a range of values for m 
and k. More importantly, our implementation shows impressive efficiency in computing 
the degree of V*(VfU mi fc) for larger values of m and k, including a component of degree 
as high as 50467100, for m = 4 and k = 5. Table 4 shows the degree of V*(VW mj fe) for a 
range of m and k values which is a significant expansion of the existing results presented 
in [Til I36| and a substantial improvement over the existing algorithm outlined in [B]. 

Table 5 shows the speedup ratio achieved by the GPU-based algorithm, presented in 
§4, over its closest CPU-based implementation MixedVol-2.0 (2Sj which is widely regarded 
as one of fastest serial software program for computing “mixed volume” (See Remark 2 
and Appendix A for its connection with degree computation considered in this article). 
Remarkably, with sufficient GPU threads nearly 30 fold speedup ratio has been achieved 
by the double-precision version of the algorithm. When the single-precision version is 
used, even higher speedup ratio can be achieved, Unfortunately, it appears that single¬ 
precision is, in general, not reliable in handling very large problems due to its insufficient 
precision. 

More important to note is the great potential of the GPU-based algorithm when mul¬ 
tiple GPU devices are used. Table 6 shows the speedup ratio achieved by multiple GPU 
devices when compared to a single GPU, a single CPU, and a small cluster of 100 nodes. 


22 
















































m/k 

1 

2 

3 

4 

5 

6 

7 

8 

1 


2 

4 

8 

16 

32 

64 

128 

2 

2 

14 

92 

584 

3632 

22304 

135872 

823424 

3 

4 

92 

1620 

26762 

437038 

7029180 

111135118* 

> 100100328 

4 

8 

584 

26762 

1169876 

50467100 

> 11907022 

> 37567994 


5 

16 

3632 

437038 

50467100 

> 99710106 

> 62944504 



6 

32 

22304 

7029180 

> 11907022 

> 62944504 




7 

64 

135872 

111135118* 

> 37567994 





8 

128 

823424 

> 100100328 







Table 4. The degree of the C*-solution set defined by (32) for a range of m and k values. This 
table lists only the results that can be computed within 1 hour on a NVidia GTX 780 graphics 
card with the double-precision version of the GPU-based parallel algorithm presented in this 
work. Shaded entries correspond to and agree with the results already presented in ill) 1361 . 
Entries marked by * are results that cannot be computed with any CPU-based program within 
a reasonable amount of time (2 days for multi-core systems and 7 days for clusters). Entries 
marked by > are lower bounds of the degrees computed by counting the total number of cells 
in the simplicial subdivision of the polytope associated with (32). 


GPU threads 

DP Speedup ratio 

SP Speedup ratio 

64 

0.00 

0.00 

128 

0.00 

0.00 

256 

0.00 

0.00 

512 

0.91 

0.73 

1024 

0.98 

4.14 

2048 

1.15 

6.66 

4096 

2.20 

10.99 

8192 

4.01 

18.71 

2 14 

7.99 

35.00 

2 15 

15.00 

40.10 

2 16 

16.33 

45.33 

2 17 

29.47 

44.99 

2 18 

28.33 

41.06 


Table 5. Speedup ratios achieved by the GPU-based double-precision (DP) and single-precision 
(SP) algorithm respectively on NVidia GTX 780 when compared to MixedVol-2.0. “0.00” repre¬ 
sents speedup ratios too small to be measured reliably. The number of threads are chosen to be 
multiples of 32 which is the “warp size” (smallest group of threads in CUDA framework). 


With three GPU devices, over 60 fold speedup over the single-threaded CPU-based al¬ 
gorithm (MixedVol-2.0) has been achieved. The most surprising result is the comparison 
between the GPU-based algorithm, developed in this article, running on three GPU 
devices and a similar CPU-based algorithm running on a small cluster. MixedVol-3 is 
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N.o. devices 

1 

2 

3 

Speedup over single device 

100% 

188% 

213% 

Max. speedup over CPU 

28.33 

54.12 

61.00 

Max. speedup over a cluster of 100 nodes 

0.48 

0.91 

1.04 


Table 6. Speedup ratios achieved by using multiple identical NVidia GTX 780 devices with the 
single device performance (using the same algorithm) as a reference. 


a parallel version of MixedVol-2.0 [213 an d, now, a part of a larger software program 
Hom4PS-3 [2]. With three NVidia GTX 780 our GPU-based algorithm computes the de¬ 
gree for V*(VMU.s) faster than MixedVol-3 on a small cluster totaling 100 Intel Xeon 
2.4Ghz processor cores. 

In computing the degrees of V* (VW m ^) for certain larger rn and k. while the GPU 
based algorithm was able to obtain the simplicial subdivisions of the polytopes associated 
with V*(VW TOj fc), the volumes of certain cells, which are given as a matrix determinant 
(16), could not be computed with sufficient accuracy to ensure the exactness of the 
answer. However, since the volume of each cell is at least one, the total number of 
cells is therefore a lower bound of the degree which equals the total volume of all the 
cells. Although these lower bounds are likely to be much smaller than the actual degrees, 
given the sheer size of these systems, these partial results still merit further investigations 
and improvements on the approach presented here. The lower bounds are therefore also 
included in Table 4 (entries marked with “>”). 

While the rigorous analysis and physical interpretation of the data presented here are 
outside the scope of this article, the rich set of data shown in Tables 2, 3, and 4 appear 
to show some general pattern. To motivate further research in this important problem, 
we summarize these patterns in the form of a conjecture: 

Conjecture 1. In general, for to, k £ Z + with to ^ k, the solution set V*(VW tTlj fe) 
consists of a single component of dimension 

dimV*(VW TO) fe) = mk + 2. 

Furthermore, for m = 1 and m = 2, the degree of the solution set is given by 
deg V*(VWi, fc ) = 2 • deg V^VWi.fc.r) = 2 fc ~ 1 

k—2 

deg V*(VIU 2 , fc ) = 6 • deg V*(VH / 2; fe_ 1 ) + 2 2k ~ 3 = 2 • 6 fc_1 + ^2 2 ( fe -^“ 3 • & 

3=0 


7. Conclusion 

This this article, we proposed a parallel algorithm for computing the degree of com¬ 
ponents of a C*-solution set defined by a binomial system that is specifically designed 
for GPU devices. Numerical experiments with the CUDA based implementation shows 
remarkable performance and scalability when applied to the binomial systems of the 
master space of AT = 1 gauge theories. 
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Appendix A Kushnirenko’s theorem 

Theorem 1 (Kushnirenko [27] L Consider the system of k Laurent polynomial equations 


a^ 1 ) 

Cl,lX a 

+ Ci j2 x a(2) 

+ ■ ■ 

a W 

• + Cl’kX 

= 0 

a (1 > 

C2,lX a 

+ C2,2®° <21 

+ ■ ■ 

a «) 

• + C 2 ,kX a 

= 0 

„(1) 

„(2) 



= 0 

Ck,lX a 

+ C k ,2X a 

+ ' 

~b Ck,kX 


in k variables x = [x\,.. . , Xk) in which every equation has the same set of monomials 
determined by exponent vectors a (1 ),. .., a £ Z fc . With “generic” coefficients Cij £ C*, 
the solutions of this system in (C*) fe are all isolated and nonsingular. The total number 
of these isolated solutions is 

fc! • VoU(conv{a (1 \ ..., a (f ^}). 

This important result was later generalized significantly in [lj where the number of 
isolated nonzero C*-solutions of a system of Laurent polynomial equation is shown to 
be equal to the mixed volume of the Newton polytopes of the system. This number is 
now commonly known as the BKK bound of a Laurent polynomial system. Therefore, 
the degree computation discussed above can be considered as a special case of the BKK 
bound i.e. mixed volume computation. 
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